视觉变换器将每个图像分成具有固定长度的令牌序列,并以与自然语言处理中的单词相同的方式处理令牌。更多令牌通​​常会导致更好的性能,但计算成本显着增加。通过谚语“一张图片胜过千言万语”,我们的目标是通过制造长图像短而加速VIT模型。为此,我们提出了一种新颖的方法在推论期间自适应地分配令牌长度。具体而言,我们首先培养一种含有可调整化 - vit(Revit)的Vit模型,可以处理任何具有不同令牌长度的给定输入。然后,我们从Revit检索“令牌长度标签”,并使用它培训轻量级令牌长度分配(TLA)。令牌长度标签是最小的令牌,以分割Revit可以使REVIT可以进行正确的预测,并且学习TLA以基于这些标签分配最佳令牌长度。 TLA使REVIT能够在推理期间使用最小足够数量的令牌处理图像。因此,通过减少VIT模型中的令牌数字来提高推广速度。我们的方法是一般的,与现代视觉变压器架构兼容,可以显着减少计算扩展。我们在两个任务中验证了我们对多个代表性VIT模型(DEIT,LV-VIT和TIMESFRER)的效果(图像分类和动作识别)。
translated by 谷歌翻译
这项工作调查了神经架构搜索中的批量标准化(NAS)。具体来说,Frankle等人。发现培训Batchnorm只能实现非竞争性能。此外,陈等人。声称培训Batchnorm只能加快10次单次NAS超网关的培训。批判性地,没有努力理解1)为什么训练Batchnorm只能找到具有减少的超空网训练时间的表演井架构,而且2)列车-BN的超网和标准列车超空网之间有什么区别。我们首先显示列车-BN网络融合到神经切线内核制度,从理论上获得与所有参数的所有参数相同的训练动态。我们的证据支持索赔仅在超培训时间上训练Batchnorm。然后,我们经验披露了培训-BN的超标网络在其他运营商的卷曲中提供了优势,导致架构之间的不公平竞争。这是因为只有卷积运算符被附加到Batchnorm。通过实验,我们表明这种不公平性使得搜索算法容易选择具有卷积的模型。为了解决这个问题,我们通过在每个操作员上放置批处理层来引入搜索空间的公平性。然而,我们观察到Chen等人的性能预测因子。在新的搜索空间上不可应用。为此,我们提出了一种新颖的综合性能指标,从三个视角评估网络:源自Batchnorm的理论属性的表达性,培训和不确定性。我们展示了我们对多NAS基准的方法(NAS-BENCH101,NAS-BENCH-201)和搜索空间(飞镖搜索空间和MOBILENET搜索空间)的有效性。
translated by 谷歌翻译
This paper studies how to flexibly integrate reconstructed 3D models into practical 3D modeling pipelines such as 3D scene creation and rendering. Due to the technical difficulty, one can only obtain rough 3D models (R3DMs) for most real objects using existing 3D reconstruction techniques. As a result, physically-based rendering (PBR) would render low-quality images or videos for scenes that are constructed by R3DMs. One promising solution would be representing real-world objects as Neural Fields such as NeRFs, which are able to generate photo-realistic renderings of an object under desired viewpoints. However, a drawback is that the synthesized views through Neural Fields Rendering (NFR) cannot reflect the simulated lighting details on R3DMs in PBR pipelines, especially when object interactions in the 3D scene creation cause local shadows. To solve this dilemma, we propose a lighting transfer network (LighTNet) to bridge NFR and PBR, such that they can benefit from each other. LighTNet reasons about a simplified image composition model, remedies the uneven surface issue caused by R3DMs, and is empowered by several perceptual-motivated constraints and a new Lab angle loss which enhances the contrast between lighting strength and colors. Comparisons demonstrate that LighTNet is superior in synthesizing impressive lighting, and is promising in pushing NFR further in practical 3D modeling workflows. Project page: https://3d-front-future.github.io/LighTNet .
translated by 谷歌翻译
We focus on causal discovery in the presence of measurement error in linear systems where the mixing matrix, i.e., the matrix indicating the independent exogenous noise terms pertaining to the observed variables, is identified up to permutation and scaling of the columns. We demonstrate a somewhat surprising connection between this problem and causal discovery in the presence of unobserved parentless causes, in the sense that there is a mapping, given by the mixing matrix, between the underlying models to be inferred in these problems. Consequently, any identifiability result based on the mixing matrix for one model translates to an identifiability result for the other model. We characterize to what extent the causal models can be identified under a two-part faithfulness assumption. Under only the first part of the assumption (corresponding to the conventional definition of faithfulness), the structure can be learned up to the causal ordering among an ordered grouping of the variables but not all the edges across the groups can be identified. We further show that if both parts of the faithfulness assumption are imposed, the structure can be learned up to a more refined ordered grouping. As a result of this refinement, for the latent variable model with unobserved parentless causes, the structure can be identified. Based on our theoretical results, we propose causal structure learning methods for both models, and evaluate their performance on synthetic data.
translated by 谷歌翻译
双重编码器结构成功地利用了两个特定语言的编码器(LSE)进行代码转换语音识别。由于LSE由两个预训练的语言特定模型(LSM)初始化,因此双编码器结构可以利用足够的单语言数据并捕获单个语言属性。但是,现有方法对LSE的语言没有限制,并且不足以针对LSM的语言知识。在本文中,我们提出了一种特定语言的特征辅助(LSCA)方法来减轻上述问题。具体来说,在培训期间,我们引入了两种特定语言的损失作为语言限制,并为其生成相应的语言目标。在解码过程中,我们通过组合两个LSM和混合模型的输出概率来考虑LSM的解码能力,以获得最终预测。实验表明,LSCA的训练或解码方法可以改善模型的性能。此外,通过组合LSCA的训练和解码方法,最佳结果可以在代码切换测试集上获得多达15.4%的相对误差。此外,该系统可以通过使用我们的方法来很好地处理代码转换语音识别任务,而无需额外的共享参数,甚至可以基于两个预训练的LSM进行重新训练。
translated by 谷歌翻译
Linear structural causal models (SCMs)-- in which each observed variable is generated by a subset of the other observed variables as well as a subset of the exogenous sources-- are pervasive in causal inference and casual discovery. However, for the task of causal discovery, existing work almost exclusively focus on the submodel where each observed variable is associated with a distinct source with non-zero variance. This results in the restriction that no observed variable can deterministically depend on other observed variables or latent confounders. In this paper, we extend the results on structure learning by focusing on a subclass of linear SCMs which do not have this property, i.e., models in which observed variables can be causally affected by any subset of the sources, and are allowed to be a deterministic function of other observed variables or latent confounders. This allows for a more realistic modeling of influence or information propagation in systems. We focus on the task of causal discovery form observational data generated from a member of this subclass. We derive a set of necessary and sufficient conditions for unique identifiability of the causal structure. To the best of our knowledge, this is the first work that gives identifiability results for causal discovery under both latent confounding and deterministic relationships. Further, we propose an algorithm for recovering the underlying causal structure when the aforementioned conditions are satisfied. We validate our theoretical results both on synthetic and real datasets.
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
To generate high quality rendering images for real time applications, it is often to trace only a few samples-per-pixel (spp) at a lower resolution and then supersample to the high resolution. Based on the observation that the rendered pixels at a low resolution are typically highly aliased, we present a novel method for neural supersampling based on ray tracing 1/4-spp samples at the high resolution. Our key insight is that the ray-traced samples at the target resolution are accurate and reliable, which makes the supersampling an interpolation problem. We present a mask-reinforced neural network to reconstruct and interpolate high-quality image sequences. First, a novel temporal accumulation network is introduced to compute the correlation between current and previous features to significantly improve their temporal stability. Then a reconstruct network based on a multi-scale U-Net with skip connections is adopted for reconstruction and generation of the desired high-resolution image. Experimental results and comparisons have shown that our proposed method can generate higher quality results of supersampling, without increasing the total number of ray-tracing samples, over current state-of-the-art methods.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译